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TOOLS FOR PREDICTING RAINFALL FROM LIGHTNING RECORDS: 
EVENTS IDENTIFICATION AND RAIN PREDICTION USING A 
BAYESIAN HIERARCHICAL MODEL 

EDMONDO DI GIUSEPPE, GIOVANNA JONA LASINIO, MASSIMILIANO PASQUI, 

AND STANISLAO ESPOSITO 


Abstract. We propose a new statistical protocol for the estimation of precipitation us¬ 
ing lightning data. We first identify rainy events using a scan statistics, then we estimate 
Rainfall Lighting Ratio (RLR) to convert lightning number into rain volume given the 
storm intensity. Then we build a hierarchical Bayesian model aiming at the prediction of 
15- and 30-minutes cumulated precipitation at unobserved locations and time using infor¬ 
mation on lightning in the same area. More specifically, we build a Bayesian hierarchical 
model in which precipitation is modeled as function of lightning count and space time vari¬ 
ation is handled using specific structured (random) effects. The mean component of the 
model relates precipitation and lightning assuming that the number of lightning recorded 
on a regular grid depends on the number of lightning occurring in neighboring cells. We 
analyze several model formulations where storms propagation speed, spatial dependence 
and time variation incorporates different descriptions of the phenomena at hand. The 
space-time variation is assumed separable. The study area is located in Gentral Italy, 
where two storms, that differ for duration and intensity, are presented. 


1. Introduction 


Several methods and algorithms have been developed for estimating precipitation from 
satellite data. These methods can be categorized according to the type of sensors used. In¬ 


frared IR-based algorithms use the cloud-top brightness temperature (Arkin and Meisner 


1987), radar-based use passive microwave recorded from satellite radiometer (Ferraro and 


Marks, 1995 Kummerow et ah, 2001; Severino and Alpuim, 2005; Li et ah, 2008) and a 

(pooTl), 


(2000), Todd et al. 


combination of IR and satellite radar are proposed in Turk et al. 

Joyce et al. (2004), Ushio et al. (2009). However, rainfall fields estimates derived from these 


sensors have many limitations. More specifically, IR-based algorithms have deficiencies in 
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estimating rainfall fields because they do not allow to ’’differentiate thin non precipitating 
cirrus clouds and non raining cold mesoscale convective system (MCS) cloud shields” from 
precipitating clouds (Morales and Anagnostou 2003). On the other hand, satellite radar 


estimates over a specific geographical area present a significant overpassing time period and 
thus limiting the acquisition frequency (Huo et ah, 2014). Combined algorithms reduce 


such limitations though they are still biased in mountain areas and mid-latitude regions. 
The methods used by those algorithms to produce a spatial distribution of precipitation 
estimates are generally based on linear regression. For instance, Arkin and Meisner (1987) 


obtain rain estimates by means of a linear relationship between cold cloud and rainfall; Fer¬ 


raro and Marks (1995) adopt a radar-rainfall linear relation on the log scale; Kummerow 


et al. (2001) apply a linear regression using an empirical index that links hydro-meteors 


to IR channels; Joyce et al. (2004) produce estimates by a time-weighted interpolation on 
microwave derived precipitation. On the other hand, others algorithms use spatio-temporal 
models in order to reproduce a rainfall spatial surface: Severino and Alpuim (2005) use geo- 
statistical models combined with autoregressive components to obtain estimates from rain 
gages and radar measurements; Li et al. (2008) perform a calibration of radar measurements 
using a combination of threshold estimation, bias reduction, regression and geostatistical 


techniques that account for season, rainfall type, and rainfall amount. Ushio et al. (2009) 


produce estimates of the surface rainfall rates as a combination of the infrared brightness 
temperature by the GEO-IR satellites and microwave radiometer estimates, by means of 
Kalman filter technique. 

Many attempts to improve satellite precipitation estimate have been done by using 
lightning data. For example various algorithms based on a combination of IR, lightning 


and radar have been developed ( 

Goodman et al., 1988 Grecu et al., 2000; Morales and 

Anagnostou 2003 

Chronis et al. 

2004). In t 

hese works, lightning are used as a variable 
s temperature (of clouds)-rainfall relation is 

for cloud patches classification, then brightnes 

estimated for each clouds group. For instance, 
regions with lightning observations that are su 

Morales and Anagnostou (2003) individuate 

3sequently used to define the degree of active 


convection in a cloud. Then, they determine the rainy area fraction on the base of the linear 
relationship with brightness temperature and use it for assigning rain rates to convective 
area. Grecu et al. (2000) use the count of lightning to classify rainy convective systems, 
then they employ a bivariate linear regression to determine the cluster rain volume. They 
demonstrate that a combination of lightning and IR brightness temperature could reduce 
by 15% the error variance of rain volume estimates. On the other hand, Betz et al. (2008) 
use lightning data for tracking rainy cells inside convective events and show that this is 
particularly helpful in anticipating the development of severe weather and strong storm 
cells. The natural evolution of this work is to associate a rainfall rate to tracked cells. 

Another approach involving lightning data for rainfall estimation is based on the rainfall 
lightning ratio (RLR), i.e. an estimate of the amount of precipitation to be associated 
to each flash. In this approach lightning data are used for the identification of rainy 
convective areas and for calculation of the corresponding lightning-rainfall relation, and 
then to forecast the precipitation amount in proximal areas and nearest times of single 


flashes. Among others, the most interesting proposals are those of Gheze and Sauvageot 
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(1997), Petersen and Rutledge (1998), and Tapia et al. (1998). Cheze and Sauvageot (1997) 
describe the rainfall amount as a function of lightning activity for areas of about 40000 km^ 
and 15-minutes time resolution. Petersen and Rutledge (1998) link the lightning-rainfall 


relation to the type of convection, Tapia et al. (1998) b uild a model for rainfall estima tion 
on an area of 2b km? and 5-minutes resolution. Finally, Tuomi and Larjavaara (2005) use 
a ” CellSearch” clustering procedure that group flashes to identify thunderstorms and their 
lifetime characteristics after a careful algorithm tuning. Notice that none of these papers 
deals with the uncertainty associated to rainfall-fields estimation.Our model builds upon 
the RLR approach. In particular we start from the work of Tapia et al. (1998) and following 
three steps we propose a Bayesian spatio-temporal hierarchical model to obtain rainfall- 
fields estimates. The three steps are: (i) identihcation of rainfall convective events through 
a scan statistic similar to the one propose in Tuomi and Larjavaara (2005); (ii) estimation 
of rainfall lightning ratio distinguishing among different classes of storms size. In other 
words from identified convective events we estimate the rainfall lightning ratio taking into 
account the magnitude of the convective event in terms of total lightning counts; and (iii) 
rainfall estimation from lightning records through a Bayesian spatio-temporal hierarchical 
model including an evolution of the basic Tapia-Smith-Dixon model into the mean term of 
the stochastic model. 

We propose a space-time hierarchical Bayesian model in which precipitation are modeled, 
on the log scale, as function of lightning counts and space time variation is handled using 
specific random effect. We envision a model in which the rain at cell p of a regular grid, at 
time t is described by a latent random process adjusting for several specific issues of the 
available data. Our final goal is to build a predictor for the underlying spatial surface of 
the latent variable. Furthermore, the adoption of a stochastic approach let us measure the 
error associated to predictions. Notice that in what follows we define convective event an 
area where one or more convective systems are present for a given amount of time. The 
study area is located in Central Italy and the chosen case study events are the storms of 
August 5, 2004 and May 9, 2006, for both we analyze two time scale of data aggregation: 
15- and 30-minutes. The model is implemented using JAGS ([Plummer 


2003). 


The database and two case study are presented in Section where we describe both step 
(i) and (ii), i.e. the scan statistic procedure used to identify convective events and the RLR 
estimation. In Sectionj^we introduce the model implementation and the estimation of both 
posterior parameters distribution and posterior predictive distribution (step (hi)). Section 
[^reports estimation outputs and an evaluation of model predictions and a comparison with 
satellite-derived estimates. Finally, discussion and concluding remarks are in Section 


2. Available data and case-studies 

2.1. Data. The study area is identified by the geographical coordinates 41°-45° LatN, 
9.5°-14° LonE. The database is obtained by merging three databases covering the time 
span March-September 2003-2006 and reporting lightning instantaneous records, satellite 
hourly precipitation fields on a 10x10 km regular grid and the weather stations sub-hourly 
precipitation records. 
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Lightning data report the location and date of all registered cases of Cloud to 
Ground (CG) lightning within the study area. We use CESI-Sirf (2010) data¬ 


base acquired by Consorzio Lamma-Regione Toscana and Cnr-Ibimet (Consorzio 


Lamina, 2010). The CESI-Sirf lightning detection network is formed by 16 sensors 


positioned on the Italian peninsula and 8 sensors located in Austria, France and 
Switzerland. The lightning signal is an electromagnetic field detected by at least 
3 sensors and, successively, cleaned from the background noise. The optimal dis¬ 
tance between sensors is about 200 km although, they have a good performance 
up to 500 km. The sensors are of type IMPACT with broadband electromagnetic 
antennas and GPS synchronization (Global Atmospherics Technology Inc.). 

Global satellite precipitation data are distributed by the project GSMaP by the 
Earth Observation Research Center, Japan Aerospace Exploration Agency (JAXA). 
The project GSMaP is sponsored by JST-CREST and promoted by the JAXA Pre¬ 


cipitation Measuring Mission (PMM) Science Team (Okamoto et ah, 2005; Kubota 


et al., 2007; [Aonashi et ahl 2009).The GSMaPMVK+ dataset has a grid resolution 


of 0.1 degree Lat/Lon and a temporal resolution of 1 hour. The estimates of the 
surface rainfall rates are obtained as a combination of the infrared brightness tem¬ 
perature by the GEO-IR satellites and microwave radiometer estimates, by means 


of Kalman filter technique (Ushio et al., 2009). 


Point precipitation data are composed of sub-hourly observations time series coming 
from 316 weather stations located within the study area (Consorzio Lamma 2010). 
However, we removed all rain gages that presented more the 10% of missing values 
on the 15 (30)-minutes time scale. Then, the number of selected rain gages reduce 
to 171 for the event of August 5, 2004 of which 159 returns 15-minutes cumulate 
rainfall recordings and 12 30-minutes cumulate rainfall recordings. For the event of 
May 9, 2006 we have 181 rain gages of which 179 returning 15-minutes recordings 
and 2 collecting 30-minutes cumulate rainfall. 


Notice that rain gage, lightning and satellite recordings time coding differs. In fact, the 
gages rainfall recorded at time t indicates the rain accumulated during the time interval 
(f — l,t] whilst the satellite rainfall record as well as the number of flashes at time t 
represents the rain accumulated and the number of flashes recorded during the time interval 
[t,t + 1), respectively. Hence, the three databases require spatial and temporal alignment 
to be used jointly. 

In what follows we superimpose the grid defined by satellite data, i.e. a regular grid 
with 10 X 10 km cells to the study area. This grid will be our reference for the spatial 
alignment of data. Thus, we refer to the number of lightning counted in each grid cell and 
we aggregate recording of rain gages belonging to the same grid cell by taking their median 
value. 

Finally, the space-time support of two case study is: 100 (104) cells and 36 (18) time 
units for August 5, 2004 15 (30)-minutes event and 111 (112) cells and 68(34) time units 
for May 9, 2006 15 (30)-minutes event. The final database is composed of lightning and 
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rainfall records registered in 3600 (1872) and 7548 (3808) space-time units for August 5, 
2004 15 (30)-minutes and for May 9, 2006 15 (30)-minutes, respectively. 

2.2. Identification of convective events and estimation of the RLR: Step (i) and 
Step (ii). To properly model lightning-rain relation we need to identify single convective 
events. This complex operation is usually done by analyzing the atmospheric circulation by 


means of satellite images or other large scale instruments. For instance, Adler and Negri 


(1988), and Anagnostou and Kummerow (1997) have used the cloud top temperature 


obtained by means of infrared satellite data (IR). In practice, they separated convective 
from stratiform regions in the cloud system by means of IR brightness temperature. On 


the other hand, Petersen and Rutledge (1998) determined convective regions analyzing the 
correlation between total lightning flash rates and convective rainfall. Here, we propose 
to simplify the operation by using the time sequence of lightning records to this aim and, 
particularly, the use of a scan statistic procedure proposed by Ester et al. (1996). The 
procedure includes three steps: 

1) we identify a daily signihcant lightning aggregation using the marginal distribution 
of hourly lightning counts over the gridded study region; 

2) in order to capture events taking place around the boundary between two subse¬ 
quent days our time-window extends from 6 pm of the previous day to 6 am of the 
next, allowing for a 6 hours overlap between adjacent days. For instance, referring 
to the 9th of May 2006 as a day with a great lightning activity, we extend the 
analysis from 6 pm of May 8, 2006 to 6 am of May 10, 2006; 


3) we apply the scan statistic procedure proposed by Ester et al. (1996) in order to 


identify the beginning and the end of each convective event generated inside the 


chosen time-window. The procedure proposed by Ester et al. (1996) hrst locates a 


sphere of radius r on one record at location (x, y) at time t and includes all records 
within the chosen radius, if inside the sphere a minimum number of lightning ni is 
recorded, the record is included in the event. 

Notice that the scan statistic depends on the distance among points with coordinates 
in our study given by the spatial ones and time. Geographic coordinates are expressed 
in UTM (Universal Transverse Mercator) system, zone 32 and time is an integer, the 
three dimensional coordinate system has components of very different order of magnitude 
then we choose to scale them (centering with respect to the mean value and dividing by 
standard deviation). To choose the value of scan statistic radius we compute it using 
r = 0.1,0.2,0.3, 0.4 always with ni = 10 following empirical consideration on the lightning 
physics. The smallest r values return a too fragmented description of the event, while 
r = 0.4 aggregates clearly separated events. Hence we hx r = 0.3. The dbscan algorithm is 


included in the R package fpc (Henning, 2010) and it is a simplihed version of the original 
algorithm which is based on K-dist criterion (Ester et ah, 1996). For the purpose of 


clustering different convective events which are eventually occurring during the same day. 


we also experimented the spatgraphs R-library (Rajala, 2012), which compute a general 
adjacency of a given point pattern. We set a geometric adjacency determining a spatial 
clustering of lightning by means of a connection radius. However, the clustering obtained 
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by applying this alternative method presents some limitations for several tested convective 
events, the main being the clustering of lightning falling in the same area but at very 
different time points. 

The scan statistic procedure identifies 767 events with at least 50 CG-lightning and the 
largest event counts 33364 lightning. We define four categories of convective events based 
on the total number of lightning: Small, Medium, Large and Very Large events. These 
categories are described in Table where the number of cases in each class is shown. 

Table 1 . Classes of dimensionality of convective events defined on the basis 
of lightning number. 


Dimension 

Number of Lightning 

Number of cases 

Small 

< 170 

403 

Medium 

(170,900] 

270 

Large 

(900, 8000] 

84 

Very Large 

> 8000 

10 


Once we identify single convective events and their magnitude we can improve on the 
estimation of the RLR as proposed in Tapia et al. (1998). The RLR determines the mass 
of rain associated to each flash, this mass is expressed in kg m~‘^. In general, the RLR 
depends very much on the thunderstorm type and region. Quantitative estimations of RLR 
have been proposed in several studies (see Soula and Chauzy, 2001, and references therein). 
The simplest estimator Z of RLR is an average of the RLR estimated for each convective 

event, i.e. Z = —— where is the total number of convective events identified 

Alfi 

and Zp is defined as: 


' ESiEpeQ^(^>P) 

where L{t, p) is the number of lightning recorded at time t and cell p, Q is the entire area 
covered by the event and r^^'^{p) is the satellite precipitation accumulated at cell p during 
the 1-hour h time interval. Notice that the time interval is 1-hour since this is the time 
scale of satellite database, with h = 1, - ■ ■ ,Th being the duration of the event in hours. 

Our proposal is to estimate a different value of the RLR for each class of events size: 
Small, Medium and Large as in Table where Large and Very Large are merged together. 
We propose a Z estimate that is based on satellite data rather than rain gages data, as 
the latter are not enough to cover the entire study area. In particular the RLR Ze is 
calculated as in Eq. [T] dividing the total volume of precipitation registered in the area 
by the total number of lightning registered in the same area, and rain gages sparsity do 
not let an accurate computation of the numerator. Weather stations data can be used to 
build empirical correction factors accounting for the tendency of satellites to underestimate 
rainfall rates and a consequent excess of zero-grid cells in the study area. Here we define 
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two multiplicative correction factors; fi = is the average difference 

between rainfall recorded at weather stations paired with the corresponding satellite record; 
/2 = j^STAT ^'"sAT where are the total number of observations from satellite 

iV '^zero 

and weather stations and are the corresponding zero counts. Hence we obtain 

for convective event size: 


( 2 ) 



Y.h-f2-Zt 


RLR estimated on our data are reported in Table where RLR values are expressed in 
IqS pgj, CG-flash. 


Table 2. Corrected RLR estimates for 3 classes of dimensionality: Small, 
Medium and Large convective events. 


RLR (10^ m^) 

Small events 

Medium events 

Large events 

Entire 

Median 

0.1 

0.6 

8.6 

0.2 

Mean 

0.4 

2.6 

24.1 

4.0 

St.Dev. 

1.1 

8.7 

32.3 

14.5 


2.3. Two case study. In what follows we model two convective events occurred during 
the storms of August 5, 2004 and May 9, 2006. The first convective event started at 
11.24Sim. and ended at 7.56 pm on the 5th of August 2004 and the second event started 
at 00.01am and ended at 4-14 P™ on the 9th of May 2006. These events are identified 
as described in Section |2.2[ A map of the two events is reported in Figure where a 
polygon (black dot-dashed line) delimits the area covered by each convective event and 
lightning spatial propagation (light yellow), satellite precipitation (continuous blue scale) 
and observations from rain gages (black triangles) are shown. 

From a phenomenological point of view, the two events are quite different since the 
event of August 5, 2004 is a combination of two type of systems merging together: one is a 
propagating system generated over Ligurian Sea and entering towards inland, the other is 
a typical afternoon thermo-convective system generated by orography, whereas the event 
of May 9, 2006 belongs to a propagating system that enters from Ligurian Sea and stops 
at Appenini’s Mountain chain. The size of the covered areas, 93147 for August 5, 
2004 and 54131 krn^ for May 9, 2006 differs substantially. August 5, 2004 event’s centroid 
is located at 11.4 degrees Longitude East and 43.1 degrees Latitude North and the May 
9, 2006 centroid is at 11.4 degrees Longitude East and 43.8 degrees Latitude North. The 
two events differ from each other also because of the total number of lightning generated 
since the first registered, 18140 CG-flashes during 9 hours of duration in August and 3163 
CG-flashes during 16 hours of duration in May. According to the definition given in Table 
the convective event of August 5, 2004 can be considered as a Very Large event whereas 
that of May 9, 2006 belongs to the category of Large event. 
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10“E 11-E 12°E 13°E 

Longitude 


(b) 

Figure 1. The area covered by convective events of August 5, 2004 (a) 
and May 9, 2006 (b): polygon (dotted-dashed line) delimiting the event, 
lightning (light/grey), satellite precipitation (continuous blue scale) and 
rain gages observations (triangles). 
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Table 3. Summaries of non zero precipitation records of 15- and 30- 
minutes cumulated rainfall for the two case study. 


Event 

Min. 

1st Qu . 

Median 

Mean 

3rd Qu. 

Max. 

August 5, 2004 15-min 

0.2 

0.2 

0.6 

1.386 

1.2 

34.6 

30-min 

0.2 

0.4 

0.8 

2.194 

2 

58.8 

May 9, 2006 15-min 

0.2 

0.2 

0.6 

1.299 

1.6 

15 

30-min 

0.2 

0.2 

1 

1.971 

2.6 

17.6 


We analyze two different cumulate rain: (i) the 15-minutes cumulate that strictly fol¬ 
lows the temporal evolution of lightning within a storm, this measure is affected by a large 
variability that reflects on the model estimates; (ii) the 30-minutes cumulate that, on the 
other hand, removes a large part of the data variability, but might miss some relevant 
features of the storm evolution. A comparison of basic summaries of positive precipitation 
shown in Table [3] confirms the different nature of the two events: the mean volume of 
precipitation calculated as the fraction of total rainfall estimated from satellite data in the 
area during 15 (30)-minutes on number of cells is 1.386 (2.194) mm for August 5, 2004 and 
1.299 (1.971) mm for May 9, 2006; the maximum value registered for the 15 (30)-minutes 
cumulate is 34.6 (58.8) mm for August 5, 2004 and 15 (17.6) mm for May 9, 2006. 

The difference between the two events are stronger in terms of extremes of the rainfall 
distributions (3rd quartile and higher). Indeed the August 5, 2004 event has larger 4th 
quartile than the May 9, 2006 storm, suggesting larger rain rates. Furthermore, the light¬ 
ning activity during the peak on the 15-minutes time scale differs substantially since 1236 
flashes have been registered in August 5, 2004 whilst only 136 flashes have been registered 
in May 9, 2006. Notice that a strong difference in the physical structure as well as in the 
spatial pattern of the two study events allows us to test the model performance under very 
different circumstances. 

In Figure the time dynamic of lightning and rainfall rates at the 15- and 30-minutes 
time scale is described using the total number of lightings recorded at cell where a rain 
gage is located and the total rainfall recorded by the latter. The two events differ in their 
dynamic in the first part of the event time span, after the peak of the storm they behave 
in a very similar way. 

Remark that rainfall data are affected by several problems, on one hand a large number 
of zeros is recorded in both time scales, on the other hand rain gages precision (between 
0.1 and 0.2 mm) implies an almost discrete measurement of accumulated rain as shown in 
Table This issue must be taken into account in the modeling of rain gages recording. 


3. The model: : Step (hi) 

In view of the considerations at the end of previous section we assume that the observed 
cumulate rainfall at time t and location p = {px,Py) is a partially discrete stochastic 
process H{t, p) such that there exists a continuous latent process X{t, p) linked to H{t, p) 
as follows (|Sahu et'al, 2005): 
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Total lightning and accumulated rain in 100 cells for 9 hours 
5-Aug-2004 



Total lightning and accumulated rain in 104 cells for 9 hours 
5-Aug-2004 



(a) 


(b) 


Total lightning and accumulated rain in 111 cells for 16 hours 

9-May-2006 



Total lightning and accumulated rain in 112 cells for 16 hours 

9-May-2006 



(c) 


(d) 


Figure 2. Time series of rain (blue) and lightning number (light gray) at 
two time scales 15- and 30-minutes for 5 Aug, 2004 a-b) and for 9 May, 2006 
c-d). 


A2 


if X{t, p) < Cl 
if Cl < A(t, p) < C2 


XI if Cfc-i < A(t,p) < Cfc 
X{t,p) if X{t,p)>Ck 


( 3 ) 


H{t,p) = < 
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Table 4. Frequency distribution of rainfall values observed on 5th of Au¬ 
gust 2004 and 9th of May 2006 convective events at 15- and 30- minutes 
time aggregation. 


Rain classes (mm) 

[0,0.2) 

[0.2,0.4) 

[0.4,0.6) 

[0.6,0.8) 

[0.8,1) 

[1,5) 

> 5 

5Aug2004-15min 

2384 

380 

202 

136 

102 

323 

73 

% 

66.2 

10.6 

5.6 

3.8 

2.8 

9 

2 

5Aug2004-30min 

1079 

155 

98 

89 

54 

312 

85 

% 

57.6 

8.3 

5.2 

4.8 

2.9 

16.7 

4.5 

9May2006-15min 

5949 

451 

202 

134 

125 

625 

62 

% 

78.8 

6 

2.7 

1.8 

1.7 

8.3 

0.8 

9May2006-30min 

2746 

254 

120 

69 

54 

446 

119 

% 

72.1 

6.7 

3.2 

1.8 

1.4 

11.7 

3.1 


We assume that there exist k = 5 values X* = exp(Aj) — l,i = 0,... ,4 described in 
Table that occurs with positive probability. We choose the to be the mid point of 
the discretization intervals. In the sequel we are going to model the latent process on the 
log scale (V(t,p) = log(X(t, p) + l|j ) rather than the observed process II(t,p). This 


transformation is chosen mostly to smooth the impact of strong rainfall intensities (Lee 


and Zawadzki, 2005) and to allow a more sensible adoption of Gaussian representation. 


This simple method proposed by Sahu et al. (2005) and Jona Lasinio et al. (2007) let 


us to obtain a good performance of the model in predicting zero rainfall events. Other 


methods can be adopted to treat zero inflated rainfall distributions, such as in Berrocal et al. 


(2008), Fuentes et al. (2008) or Schmidt and Migon (2009). Berrocal et al. (2008) specify a 


spatial model that includes two spatial Gaussian processes driving precipitation occurrence 
and accumulation, respectively. A spatial-temporal model for rain gages and reflectivity 


radar data is developed in Fuentes et al. (2008), where a latent process corresponding to 


the true rain amount drives the probability of precipitation occurrence and the rainfall 
accumulation. On the other hand, Schmidt and Migon (2009) treat observations from 
rain gages as generated by a latent process which realizations are a mixture of a Bernoulli 
distribution that specifies the probability of having positive precipitation and a probability 
density function for the rainfall accumulation, typically an exponential, a gamma or a log¬ 
normal distribution. Notice that we are working on a hner time scale than the above cited 
authors that mostly analyze daily and even weekly data. 

We assume that 


(4) Y{t, p)|0, W ~ p) -h w{t, p), 

where ^{t,p) is a function describing the relation between rainfall and lightning, w{t,p) 
is a zero mean separable space-time process, is the process variance, and 6 is the set 


^We add I to preserve zero values. 
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Table 5. Discretization values for the latent rainfall field on the log scale Y{t,p). 


Rain Classes (mm) 

Discretization values 

[ 0 , 0 . 2 ) 

Ao = log( 0.1 -h 1 ) 

[0.2,0.4) 

Ai = log(0.3 + 1) 

[0.4,0.6) 

A 2 = log(0.5 -h 1) 

[ 0 . 6 , 0 . 8 ) 

As = log(0.7 -h 1) 

[ 0 . 8 , 1 ) 

A 4 = 1 log(0.9 -h 1) 


of all model parameters. In the following paragraphs we are going to describe each model 
component. 


3.1. The mean. To build the model for the process mean we start from the Tapia-Smith- 
Dixon deterministic model (Tapia et ah, 1998). The Tapia-Smith-Dixon model is basically 
a spatio-temporal prediction of rainfall rate based on CG-lightning patterns and RLR 
estimate, and is formalized as follows: 

Nt 

(5) Xit,s) = CY,Zfit,Ti)g{s,S,) 

i=l 

where X(t,s) is again the observed rainfall rate (mm/h) at time t and spatial location 
s, C is the units conversion factoiQ Nt is the number of flashes until time t + At/2, Z 
is the rainfall-lightning ratio introduced above, Ti is the time at which the ith flash is 
recorded, S* is the location of the zth flash, f{t,Ti) specifies the rainfall flux at time t 
determined by a lightning flash at time Ti and g{s, Si) specifies the rainfall flux at location 
s when a lightning flash is observed at location Sj. Both temporal /(t, Tj) and spatial 
g{s, Si) functions are taken to be uniform over the interval {t — 5,t + 5) minute and within 
a circle of 5 km radius around location s, respectively. These assumptions according to 
the authors are simple and represent a first step to build on. We have already built on 
the approach of Tapia-Smith-Dixon by introducing estimates of the RLR that depend on 
the size of convective events (see Eq. (©)• We introduce here the results obtained by 


applying the modified version of Tapia-Smith-Dixon model (Tapia et ah, 1998) through 
Eq. ([^ on the entire set of events identified by scan statistics. Thus, for each event, we 
estimate precipitation fields at every cell of the spatial domain for the entire duration of 
the event. This extension of the work is a valid test to evaluate the efficiency of proposed 
RLR estimator. The evaluation of this estimation procedure is done using the information 
from rain gages. More specifically, for the 767 convective events identified by scan statistic 
we select those cells where at least one rain gage is settled in, then we compare rainfall 
fields from GSMap satellite data and values computed by implementing mean values of 
RLR estimated by Eq. ([^ into Eq. with the corresponding rain gage observations. 


^Notice that RLR is expressed in kg m~^. whereas the precipitation recorded by rain gages or estimated 
from satellite data is expressed in mm m~^. Consequently, we need to transform a mass into a volume 
applying a conversion factor C = where A is the event area in square kilometers. 
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RMSE(mm) 

RLR=0.4 

RLR=2.5 

RLR=24.1 

Estimated 

15.3 

24.9 

15.9 

GsMap 

13.0 

22.8 

20.7 

Occurences 

535 

1248 

2780 


Table 6 . Root Mean Square Error of satellite precipitation (GsMap) and 
estimated fields compared to rain gage observations for 3 category of Small, 
Medium and Large events. Occurrences are the number of cells with a valid 
record. 


RLR Large events 


POD 

POFD 

Estimated 

81.3 

15.8 

GsMap 

76.1 

15 


Table 7. Comparison of Probability of Detection (POD) and Probability 
of False Detection (POFD or FAR) indexes between Estimated and GsMap 
values for Large events. 


The Root Mean Square Errors (RMSE) values are reported in Tablewhere ’’occurrences” 
are the number of cells with a valid record coming from rain gages. 

The analysis of RMSE values reveals that the adoption of our Z{d) estimate let us to 
obtain an important improvement with respect to GsMap satellite data for Large convec¬ 
tive events. Furthermore, the probability values of hitting rainy cases, i.e. with rainfall 
greater than 0.2 mm and Probability of False Detection (POFD)|^ confirm this procedure 
improves the performance of satellite data. 

Now recall that we are modeling Y (t, p) = log(X(t, p) +1) the log-latent rainfall process 
that will be analyzed using two different cumulated amounts: 15- and 30-minutes. We 
consider time series of length T on a regular grid of = ni x n 2 cells. We denote by 
L(t, p) the number of lightning at time t and location p. We start by adapting Eq. © to 
estimate rainfall amount: 

T 

(6) A^"«(t,p) = (CM;i)Z''^ Li,p/(LT,)5(p,P.) i = l,2,---,r 

i=l s^Np 

where is the (latent) rainfall amount predicted at cell p and time t; is the 

observed cell-location; Tj is the observed time; Li p is the number of lightning cumulated 
at the end of a given time interval i at cell p; Np is the set of cells belonging to the 


^Details on the meanings of POD and POFD are given further in Table [ll| 
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flasheshrain 



Figure 3. Temporal evolution of lightning (black) and associated rain 
(blue) within a convective event. The 3 stages of evolution are also in¬ 
dicated: Charging phase (Ch), Mature state (Ma) and Dissipating phase 
(Dis). 

neighborhood of cell p; is the estimated Rainfall Lightning Ratio distinguishing the 
event dimension d=Small, Medium, Large as defined in Table C*A~^ is a conversion 
factor from lO^kg m~‘^ (mass) to mm m~^ (volume) with Ap being the area of any cell in 
square meters and C* a dimensionality adjusting factor; f{t, Ti) is a time weights function; 
g'(p,Ps) is a spatial weights function. Time and spatial weight functions are built on 
the basis of both time evolution and spatial propagation of lightning. In particular, the 
life of lightning pattern inside a single rainfall convective system is composed of 3 stages: 
Charging phase (Ch), Mature state (Ma) and Dissipating phase (Dis) (see Figure. [^. Here 
we analyze a convective event that is composed of several single systems, to include the 
phenomenon evolution in the modeling process, we mimic over the entire event the behavior 
of the single one. This idea follows the events description given in Figure 

More precisely we partition the event duration into [to,Tch), [Tch,TMa) and [TMa,T]. 
Then, the time weight function is built from Figuresuch that: 


( 7 ) 




fch{t) if tkTi^Ch 

fMa,Dis{t) if t kTi€ MaljDis 


where Tch indicates the end of the Charging phase and T is the duration of the entire event. 


In practice, we assume that the Mature and Dissipating phases are equivalent in terms of 
the lightning temporal evolution, than we have two stages [to,Tch) and [Tch,T). Notice 
that if the predicting time t and observed time Tj are in different phases of event’s lifetime, 

that is both events |{t G Ch} H {Tj G MaUT>is}| and G MaUDis} H {Tj G C/i}|, 

then they receive zero weight, i.e. fch-,Ma,Dis{t) = 0. This assumption is basically made 
to avoid a strong correlation between model predictions that are distant in time from 
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Figure 4. Neighborhood of a generic cell p (blue): The first order structure 
is composed by the green colored pixels whilst the second order structure is 
obtained by adding the red pixels. 


each other. However, the assumption becomes too strong when dealing with instants of 
observation and prediction that are both close to the Tch time. To find a solution for 
this drawback, we analyze the performance of two model modifications in which the cell’s 
activity is treated without linking it to a specific event’s phase. These modifications are 
discussed below. 

To build the space weights function we assume that the number of lightning in cell p 
depends on the number of lightning occurring in neighboring cells. Then we define a 
neighborhood structure, N(p) to handle such dependency. In the following we choose N{p) 
to be a second order neighborhood of cell p (see Figure]^, then 


( 8 ) 


5(p,Ps) 


uji^s if Ps G N{p) 
0 if P, ^ Nip) 


where uji^s = l/8Tj^p^, is the number of lightning recorded in the observed cell Pg at 
time i and the proportional factor 8 represents the maximum number of neighbors of cell 
p for the chosen neighborhood structure. In this formulation, the spatial weight between 
neighboring cells is: 


(9) 


Tip 

^ - 1 " 

L/p 


1 

- * 
8 


^i,Np 

Lp 


where Li^p is the number of lightning at predicting cell p and time i, is the summation 
of lightning over the neighborhood cells of p at each time i (i.e. ^iPs) — 

Yli=i {^i,p + ^i,Np) is the total number of lightning hitting cell p for the entire duration of 
the event. 

Given the physics behind the propagation of convective events, we are going to consider 
several modifications of the mean term of the model. 

Ml: This is the simplest model among those proposed. It includes a very simple form of the 
temporal weights, that mimic the curves combined in Figure To simplify the notation. 
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let US write C = {C*Ap^) * where C is estimated outside the model as described in 
section 2.2 Thus, the fixed component of the mean is: 


= log 


( 10 ) 


exp 


2=1 

I f - 7) 


' At ' 

+ C uji^p + 1 


7[o,rc,]W+exp<^ - 


,t — Ti 


i |2 


At 


hTch,T]{t) 


2 = 1 


where /[o^Tch](^) 7[rc/i,T](^) indicator functions and is the spatial weight 

specified in Eq. ([^. 

M2: Here we include the velocity of propagation of the convective event into the temporal 


weights. We assume this velocity to be a fixed constant (16.1 m/s) as suggested in Levizzani 


et al. (2010) for convective events spanned within 1000 km and up to 20 hours of duration. 


Consequently, the time weight functions of Eq. becomes: 

{a + bV) .t-Ti 


fch{t) = exp<^ - 


fMa,Disit) = exp<^ - 


A. 


1/2 


At 


(a + bV) .t -Ti .2 


A. 


1/2 


At 


to < t & Ti < Tch 
Tch < t Sz Ti < T 


where V is the velocity of propagation and Ap is the area of a single cell. The mean of 
model Ml is: 


//(t,p) = log 


2 = 1 


exp 


(a + bV) t-Ti 

D/2 I I f + ''''P 


[a + bV) t - Tj |2 ^ ^ ^ 

^^1 (TTch,T]it> 


A. 


1/2 


+ C coi^p + 1 


2=1 


where /[o^Tch](^) ^TchTlit) are the indicator functions such as in Ml. 

M3: in this model the propagation velocity of the convective event is estimated outside 
the model but on the specific event. In particular, we compute Ei (Charging phase), E 2 
(Mature-Dissipating phase) and V (speed of the entire event). The time weight functions 
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in Eq. become: 

fch{t) = exp I - I 11 to <tkTi< Tch 

fMa,Disit) = exp I - I ^ 1^1 Tch<tkTi<T 

Vi, V 2 ape computed dividing the distance covered by the convective event from the starting 
location (centroid of lightning hit locations in the interval [to, =t60sec]) to peak time location 
and from peak time to the ending location by the length of each time interval. V is 
calculated as the ratio between the entire distance covered by the event and its duration. 
Then, the latent process mean /r(t, p) is: 


Kt, P) = log C ^ Li,p=t 


2=1 


exp 


jbiVi) | t-r, 

V ' At 

+ C + 1 I 

i=i J 


+exp 


(b2 ^ 2 ) | t-T, .2 

V I At I 



M4: this modihcation of the model is based on single cell activity rather than on the three 
lifetime stages of lightning propagation. More specifically, for each cell, hrstly we apply 
a 3-terms moving average on the time series of lightning then we dehne active only those 
times where the number of lightning is greater than 5 and we define a cell to be active 
for 30-minutes following the last selected time to filter out local convection from moving 


convective systems. These threshold values are similar to those identified by Tuomi and 


Larjavaara (2005) over Finland. Thus, the formalization of mean component becomes: 


fi(t, p) = log 




t-Tj 

At 


T 

p) + ^ X] p) 

2 = 1 



where I[t',t"]it-,'P) is the indicator function which is 1 if either the predicted time t or the 
observed time Tj are in the phase of cell’s activity identified by the time interval [f, t”] and 
0 otherwise. We adopt both simple and quadratic exponential decay, so defining a further 
model M5: 


/i(t,p) = log ( 


Li^p * 


2 = 1 


exp 


, t-Ti 

' At 




2=1 


M6: Here we take into account lifetime phases of convective events in addition to cell’s 
activity: 
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= log Li^p* 

\ i=l 

I[0,TcH]i*)I[t\t"]it,P) + exp 

T 

+ ^ P) +1 

i=l 

where I[o,Tch](^) ^[Tch,T]i't) are the indicator functions assuming value 1 if either 
the predicted time t or the observed time Tj are in the same lifetime phase (Charging or 
Mature-Dissipating) and 0 otherwise whilst I[t',t"]it^p) is the indicator function which is 1 
if either the predicted time t or the observed time Tj are in the activity phase of the time 
interval and 0 otherwise. 



A further model variation, named memory, is obtained by changing the way the spatial 
weights are computed. Instead of considering all over the entire event duration at once, 
we consider weights along a 30-minutes time window regardless of the cell’s activity, then 


( 11 ) < 

for 30-minutes and 
( 12 ) < 


Li,p -|- I + Li-i^p -|- I Lj_i^7Vp 

Ylk=l ^k,p 

Li,p + I Li,Np + Li-i^p I Lj_i jVp + +Li-2,p + I Li-2,Np 

'l2k=l ^k,p 


for 15-minutes aggregation, respectively. This modification let us to better account for 
local features of rainy events. We expect this last variation to be more effective for the 
May event given its temporal and spatial structure. Finally, we summarize in Table all 
model modihcations we propose. 


3.2. The space-time variation. The w{t,p) component explains the residual spatial 
variation of rainfall after accounting for that included in the mean representation. In 
practice, the W component determines a random increase/decrease of the intercept and it 
describes the overall variance structure. The w{t, p) is the (t, p) element of W, a separable 
space-time random field such that w{t, p) = T(f)-|-S'(p) where the temporal part is specified 
as an autoregressive model of order 1: T{t) = aT{t — 1) -|-r 7 (t) with r/(t) ^“(0, Gp) a nd the 

spatial part is a Gaussian Conditional Autoregressive model (CAR) (Besag, 1974). The 
CAR model is characterized by a clear link between the conditional and the joint probability 
distributions and, consequently, we can use the local information to make inference on the 
random field S (details can be found for example in Besag, 1974; Cressie 1993 Banerjee 


et al., 2004). Formally the spatial random field S = {S\, ■ ■ ■ ,S^ 


pi 


, Sn) for all times t has 


a multivariate normal distribution 
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Table 8. Summary of proposed models. 


Model modifications 

Velocity 

Phases 

Pixel Activity 


Parameters 

Ml 

M2 

X 

X 

X 



9 = {a,6,a,T2 t|,ps,t2} 

= {61,62, Q!,r2,T|, PS, T^} 

M3 

X 

X 

- 

e 

M4 

- 

- 

X 


M5 

- 

- 

X 



M6 

- 

X 

X 


M2memory 

X 

X 

- 


9 = {a,6,a,T/,T|,ps,r^} 

MSmemory 

X 

X 

- 

9 

= {61,62, a, PS, T^} 


(13) S~MAr(0,cj|(I-psB)~i) 

where ag is the process variance, B is an adjacency matrix such that bij = 1 if cell i is 
a neighbor of cell j and zero otherwise, ps is coefficient measuring the strength of spatial 
interactions and it has to lie in the interval ( 1 /m, 1 ) with m the maximum number of 
neighbors in the adopted neighborhood structure. This last condition ensures that (13) is 
well defined (see for example Chapter 5 of Banerjee et al. (2004[)). 


Prior distributions. We suggest the following prior distributions for the models param¬ 
eters: a ~ cr«), ~ /nur(a^, 6 ,,); cr| ~ Invr{as,bs), ps ~ 

cr^ ~ InvT{ar,br) and for M2: o ~ r(ao,6o)) b ~ r(ai,6i) and M3: 61 ~ r(ai,6i), 
62 ~ r(ai, 6i). 

Notice that the inclusion of a 6 lightning-rain delay could be eventually inserted in the 
model. 

In our implementation we set the proposed priors as a, 6 , 6 i, 62 ~ r(0.001,0.001), for com¬ 
putational reasons we simulate using precisions instead of variances then = cr“^ and 
t| = ~ r(0.001,0.001). a ~ A^(0.5,100) and ps ~ 17(0,1/m) where m is the maxi¬ 

mum number of neighbors. Finally, = cj“^ ~ r(0. 001, 0.001). 


Estimation: predictive distribution. The Gaussian framework we adopt simplifies the build¬ 
ing of the predictive distribution of the latent process Y. Consider an unobserved location 
in space and time (tojPo), the joint distribution of (Y(to, Po), Y)|0, W is Gaussian as well 
as the conditional distribution Y(to, Po)| Y, 0, W. The predictive distribution is obtained 
integrating over the parameters set and it is analytically intractable. However we can draw 
samples from it using Monte Carlo methods. We have N* posterior samples of 6 and W, 
for each sample j (j = 1,..., N*) we simulate Yj{to, po) from 7 r(y(to, Po)| Y, 6j,'Wj) and 
use it for inferential purposes. 


4. Results 
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86 cells selected for estimation 



91 cells selected for estimation 



(a) (b) 


88 cells selected for estimation 91 cells selected for estimation 



5 - 

-1- sampled cells for estimation 



IS validation cells 










... 







(c) (d) 

Figure 5. Grid locations (a)-(b) August 5, 2004 15-minutes and 30- 
minutes time scales; (c)-(d) May 9, 2006 15-minutes and 30-minutes time 
scales. In panel (a)-(d) the selected cells for estimation (black crosses) and 
validation cells (red boxes) are mapped. 


4.1. Estimation of parameters. In the definition of the spatial weights and of the spatial 
dependence we adopt a second order nearest-neighbor structure as shown in Figure]^ A 
random sample of cells is drawn from the set of grid cells with at least one rain gage 
D = {1, • • • ,p, -'' )^} for estimation. The remaining cells of the same sub-grid are set 
aside for validation purposes (see Figure]^. A summary description of the training and 
validation sets is reported in Table 
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Table 9. Number of estimation (Est.) and validation (Val) cells and max¬ 
imum number of neighbors (m) in the spatial neighborhood structure for 
each case study. 


Case 

n 

Est(#cells) 

Val(#cells) 

m 

5Aug2004-15min 

100 

86 

14 

8 

5Aug2004-30min 

104 

91 

13 

8 

9May2006-15min 

111 

88 

23 

7 

9May2006-30min 

112 

91 

21 

7 


The model estimation is implemented in JAGS (Plummer, 2003) using the package 


R2jags (Yu-Sung and Masanao, 2012) to run the simulation within R. We run two chains 
with dispersed starting points for 20000 iterations, with a burn-in of 5000 and a thinning of 
1/20 such that we retain the last 1000 iterations of each chain for estimation. Convergence 
was inspected both graphically and from several summary statistics. Trace-plots of chosen 
parameters and summary statistics for the “best” models are presented in Figure]^ and 
and in Table 


10 


Trace-plots and the Potential Scale Reduction Factor R proposed by 
Gelman and Rubin (Gelman and Rubin, 1992) show that the MCMC converges rapidly for 
all parameters and model setting. 
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Figure 6 . Trace-plots obtained with 20000 iterations and 2 chains after a 
burn-in of 5000 and a thin of 1/10: a) M3 for August 5, 2004 at 15-minutes 
time scale, b) M3 for August 5, 2004 at 30-minutes time scale. 
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(b) 


Figure 7. Trace-plots obtained with 20000 iterations and 2 chains after 
a burn-in of 5000 and a thin of 1/10; a) M2memory for May 9, 2006 at 
15-minutes time scale, b) M2 for May 9, 2006 at 30-minutes time scale. 




































Table 10. MCMC Posterior Inference for best performance cases: M3 for August 5, 2004 at 30- 
minutes time scale and M2memory for May 9, 2006 at 15-minutes time scale. 
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Table 11. Calculation of Probability of Hits on Total (POHT), Probability 
of Detection (POD), Probability of False Detection (POFD or FAR) and 
Heidke skill score (HSS). 


Observed 




Rain 

No Rain 


Predicted 

Rain 

a 

b 

a+b 


No Rain 

c 

d 

c+d 



a+c 

b+d 

n 


Indexes 


Acronym 


Formula 


Probability of Hits on Total POHT 

Probability of Detection POD 

Probability of False Detection POFD 

Heidke Skill Score HSS 


(a + d)/n 
a /(a + c) 
b/{b + d) 

(g+d) —[(a+c)(a+fc) + (d+c)(rf+b)]/n 
n—\{a+c){a+b)-\-{d+c){d+b)\/n 


4.2. Evaluation of predictions. We use two categories of indexes to evaluate the pre¬ 
dictive performance of our modeling approach: 1) for the first category, we compute the 
Deviance Information Criterion (DIC) for evaluating model fit and the Empirical Cov¬ 
erage (EC) of 90% credible intervals for evaluating the credibility of simulation results; 
2 ) for the second category, we adopt the same validation scheme used for meteorological 
variables (see for example Murphy and Daan (1985) or Thornes and Stephenson (2001)) 
to analyze the quality of predictions over the validation set. In particular, we use Root 
Mean Square Errors (RMSE) to measure the mean error when considering predictions and 
observations as continuous variables; Probability of Hits on Total (POHT), Probability of 
Detection (POD) and Probability of False Detection (POFD) when classifying events as 
dichotomous variables into rain/no rain, that is [0,0.2), [0.2, oo) and, finally, Heidke skill 
score (HSS) always for the same two categories. To facilitate the understanding of indexes 
role we report their definition in Table 11 following iBarnes et al. (2009) notation. POHT 


is a measure of accuracy since it is the percentage of cases correctly predicted in both 
classes Rain and No Rain] POD is the rate correctly predicted Rain events and POFD, 
also known as False Alarm Ratio (FAR), is the rate of events mistakenly predicted as rainy. 
It is worth considering POHT when predictions of Rain and No Rain events have, as in the 
present study, the same relevance. On the other hand, POD and POFD are focused on the 
evaluation of predictions in the Rain class. Nevertheless, a joint analysis of these indexes 


is generally recommended (Barnes et ah, 2009) to fully understand model’s performances. 


Finally, HSS measures the accuracy of the model in predicting the correct category with 
respect to results due to random chance. HSS ranges from —oo to 1: 0 indicates no skill, 
1 is the perfect score. 

The values of DIC, POHT, POD, POFD, RMSE and HSS obtained from predictions of 


August 5, 2004 and May 9, 2006 are reported in Table [1^ and [1^ respectively. 

Recall from Section |2.3| that the two case study differ substantially in terms of both 
spatial extent and time duration. Furthermore, they have a different physical structure 
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Table 12. Evaluation of MCMC estimates for the events August 5, 2004 
15-minutes and August 5, 2004 30-minutes (“best” model in bold). 


5-Aug-2004-15min 

DIG 

EG(%) 

POHT(%) 

POD(%) 

POFD(%) RMSE(mm) 

HSS 

Ml 

3606.7 

92.9 

73.4 

84.9 

33.2 

0.596 

0.474 

M2 

3115.9 

93.2 

75.0 

82.2 

29.2 

0.546 

0.496 

M3 

3097.4 

93.3 

75.4 

83.2 

29.2 

0.543 

0.505 

M4 

3566.3 

92.8 

74.4 

84.9 

31.7 

0.604 

0.491 

M5 

3487.9 

92.9 

74.8 

80.5 

28.5 

0.595 

0.490 

M6 

3613.1 

92.9 

73.8 

84.9 

32.6 

0.606 

0.481 

5-Aug-2004-30min 

Ml 

2165.3 

92.6 

75.6 

87.8 

33.1 

0.575 

0.522 

M2 

2035.1 

92.9 

76.9 

85.7 

29.4 

0.524 

0.543 

M3 

2005.3 

92.6 

79.1 

87.8 

27.2 

0.529 

0.585 

M4 

2227.3 

92.4 

77.8 

78.6 

22.8 

0.567 

0.550 

M5 

2253.4 

92.5 

80.8 

82.7 

20.6 

0.548 

0.611 

M6 

2307.3 

92.7 

77.4 

79.6 

24.3 

0.578 

0.543 

Table 13. 

Evaluation of MGMG estimates for the events May 9, 2006 15- 

minutes and May 9, 

2006 30-minutes (“best” model in bold) 



9-May-2006-15min 

DIG 

EG(%) 

POHT(%) 

POD(%) 

POFD(%) 

RMSE(mm) 

HSS 

Ml 

2160.7 

91.4 

74.2 

56.3 

20.8 

0.424 

0.319 

M2 

2158.4 

91.3 

75.0 

56.6 

19.9 

0.424 

0.333 

M3 

2168.0 

91.4 

74.9 

57.5 

20.2 

0.419 

0.336 

M4 

3803.4 

92.7 

69.6 

58.4 

27.3 

0.448 

0.257 

M5 

3759.5 

92.6 

69.6 

58.4 

27.3 

0.448 

0.258 

M6 

2543.0 

91.6 

73.0 

56.0 

22.3 

0.435 

0.298 

9-May-2006-30min 

Ml 

3203.9 

91.0 

69.9 

67.9 

29.2 

0.560 

0.353 

M2 

3167.8 

91.4 

69.2 

66.5 

29.6 

0.557 

0.337 

M3 

3173.6 

91.3 

68.5 

65.1 

30.0 

0.564 

0.321 

M4 

4041.2 

91.3 

66.4 

64.7 

32.9 

0.641 

0.286 

M5 

4017.4 

91.5 

65.1 

67.0 

35.7 

0.632 

0.276 

M6 

3994.5 

91.3 

66.0 

67.4 

34.7 

0.626 

0.290 


since the August event is formed by two storms merging near the Appenini Mountain: one 
is a propagating system moving from west to east and the other is a thermo-convective 
event generated by the orography, whilst the May event is unique and propagates linearly 
from west to east . 

The empirical coverage of all models is close to the nominal value of 90%, suggesting a 
very good accuracy level of the proposed modeling approach. Through the DIG we select 
the best fitting model for each situation, which is M3 for both time scale in the August 
2004 event and M2 for the May 2006 event. In practice, adding the velocity of propagation 
into the model (M2, M3) improves the predictive capacity. The simpler Ml model for 
the May event at 30-minutes time scale returns a slightly larger DIG but an equivalent 
and some times better performance in terms of verification indexes, suggesting that in this 
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Table 14. Comparison of best performing models and their version with 
spatial weights given in © and ([T^. 



DIG 

EC(%) 

POHT(%) 

POD(%) 

POFD(%) 

RMSE(mm) 

HSS 

5-Aug-2004 

MSmemory-15min 

3105.2 

93.5 

74.6 

84.9 

31.3 

0.536 

0.495 

M3memory-30min 

1987.3 

92.6 

77.8 

87.8 

31.7 

0.531 

0.561 

9-May-2006 

M2memory-15min 

1787.4 

91.1 

76.7 

56 

17.6 

0.420 

0.360 

M3memory-30min 

2940.0 

91.4 

69.9 

66.1 

28.4 

0.545 

0.346 


situation the inclusion of the storm propagation speed in the model does not significantly 
improve its predictive capacity. The use of two different values for V, one for the Charging 
phase and the other for the Mature-Dissipating phase is useful only for the August 2004 
event, suggesting that this differentiation is relevant for the prediction and analysis of large 
storms. Furthermore, model modifications M4, M5 and M6 (based on the relation between 
cells with lightning activity) do not return better predictions. 

The RMSE ranges from 0.419 mm up to 0.614 mm that in absolute terms suggests an 
average error of the same magnitude as the rain gages error. In relative terms we can 
evaluate this error in relation to the maximum recorded in each situation as a measure of 
event size. Considering only the best models performances again we appreciate a better 
result with larger events with errors ranging from 0.9% (August at 30-minutes) to 1.6% 
(August at 15-minutes), while in the May event errors range from 2.8% (15-minutes) to 
3.2% (30-minutes). HSS suggests a reasonable performance of the model for the largest 
event of August and a poor overall performance for the smaller, more complex. May storm. 
A deeper analysis of the performance indexes (POD, POFD, POHT) highlights the model 
difficulty in predicting extreme rain amounts (zero and peaks). One possible reason for 
this behavior is the chosen spatial structure that takes into account eight cells around each 
predicting cell for a total area of 30 x 30 km^. The latter maybe a too large area when a 
high rain rate is recorded in one cell. Convective systems can show very different rain rates 
during their development. In particular, the larger the rain rate in a cell, the smaller the 
cells cluster with non zero rainfall. In view of the last consideration we estimated the same 
models with a smaller neighborhood structure (maximum number of neighbors 4) but no 
consistent improvement was obtained. Then we applied the spatial weights modifications 
given in © and ( |12[ ) to the first three best performing models for both events, the best 
models results are shown in Table 14 The modification of spatial weights improves the 
model predictive performance for the May 9, 2006 event whereas it seems to return no 
clear advantages for the Aug 5, 2004 event. 

Finally, we select the best performance models including those with memory variant: 
evaluation of MCMC estimates are reported in Table [Tsl 

Model M3 shows an acceptable performance for both time scales in the August event. 
Only a tendency to underestimate the number of zero observation is highlighted by the 
POFD values. Time dynamic is well captured by the best models almost everywhere with 
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Table 15. Evaluation of MCMC estimates for the events 5-Aug-2004- 
15/30min and 9-May-15/30min, Only the best models are reported. 



DIG 

EC(%) 

POHT(%) 

POD(%) 

POFD(%) 

RMSE(mm) 

HSS 

5-Aug-2004 

15-min: M3 

3097.4 

93.3 

75.4 

83.2 

29.2 

0.543 

0.505 

30-min: M3 

2005.3 

92.6 

79.1 

87.8 

27.2 

0.529 

0.585 

9-May-2006 

15-min: M2memory 

1787.4 

91.1 

76.7 

56 

17.6 

0.420 

0.360 

30-min: M2 

3167.8 

91.4 

69.2 

66.5 

29.6 

0.557 

0.337 


few exceptions corresponding to extreme recordings. As an example, we report “best” and 
“worst” estimated values, the predicted (black-crossed) series together with the observed 
(blue line) ones by M3 in August at 30-minutes (panel (a) of Figure]^ and by M2memory 
in May at 15-minutes (panel (b) of Figure together with the 90% credibility intervals 
(red lines). In the plots the correlation between observed and predicted values is reported. 
The worst performances are always shown during the smaller event (May) regardless the 
time scale. The main issue seems to be the prediction of sudden changes in the series, while 
the prediction of other peaks is almost always correct in both their location (in space and 
time) and size. 

In summary results obtained varying the neighborhood dependence structure or allow¬ 
ing for one time step delay (not shown) in the lightning-rainfall relation show that these 
structure changes seems to be not clearly influential on estimates quality. On the other 
hand, some improvement is observed when limiting the effect of neighborhood structure 
to the 30-minutes before the observed time, in particular a DIG reduction and a slight 


general improvement for May 9, 2006 event (see Table 14). This improvement is due to 
the fact that this modification introduces a limited memory on the spatial relation helping 
to account for the slow movement of the entire system along its 16 hours of duration. On 
the contrary, an unlimited version of spatial weights, such as in M3 is best performing for 
more rapid August event (9 hours of duration). 


We are interested in verifying if any improvement is obtained by our best models pre¬ 
dictions on GSMap satellite estimates (Ushio et ah, 2009). Since temporal resolution of 


the latter is 1-hour, we aggregate model estimates to the same time scale. We use rain 
gages observation as benchmark for the comparison taking into account those cells where 
rain gages are present computing 1-hour cumulated rain. We compare residuals distribu¬ 
tions (the difference between model output and rain gage recordings panel (a) in Figure 
and [Io| ) , the spatial distribution of the difference between estimated events volume and 
observed one, where volumes are obtained by integrating over time in each cell both predic¬ 
tions and rain gauge recordings and predicted versus observed performance (panels (c)-(d) 
in Figure and [Tol ) , together with predicted versus observed scatter plots shown in panel 
(d) of Figure]^ and 10). 

The median residuals for August event at 30-minutes time scale, are 0.00 and 0.20 for M3 
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Cell 615 



(a) August best, corr= 0.92 


Cell 415 



Cell 938 



30-rriin time steps 


(b) August worst, corr= 0.28 


Cell 574 



15-mir time steps 


(c) May best, corr= 0.80 (d) May worst, corr= 0.23 

Figure 8. Time series of rainfall predicted values at selected validation 
cells for M3 Aug 5, 2004 at 30-minutes time scale ((a) best and (b) worst) 
and M2memory May 9, 2006 at 15-minutes time scale ((c) best and (d) 
worst). Predicted values in black line, observed values blue line, 90% cred¬ 
ibility intervals red line. 


and GSMap respectively, their variances strongly differ as shown in Figure]^ (a) (variances 
are 18.48 for M3 and 25.22 for GSMap) suggesting more accurate predictions returned 
by M3. As far the May event at 15-minutes time scale is concerned both models return 
a zero median residual, but again GSMap produces more variable estimates: variance of 
M2memory 4.44, variance of GSMap 7.74 (see Figure 10 panel (a)). 
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The spatial distribution (Figure j^b) and (c)) suggest a better predictive performance of 
the proposed model than the GSMap, indeed most M3 residuals are closer to zero than 
those of GSMap. Moreover, notice that both M3 and GSMap show a slight overestima¬ 
tion that seems stronger for GSMap. The log-predicted versus log-observed plot highlights 
how M3 better captures larger values than GSMap while both have a similar behavior in 
predicting small rain amounts (Figure panel (d). Similar considerations can be done 
when comparing M2memory and GSMap for the May event at 15-minutes, again a better 
performance of the proposed modeling approach with respect to GSMap (Figure [Io| panel 

(d)). 


5. Discussion and Concluding remarks 

The main goal of this work is to build a working protocol that allows to isolate rainy 
events and to predict the accumulated precipitation at unobserved locations and times 
using lightning counting. The first contribution is a simple and effective scan statistic 
procedure used for the identification of single storms among several severe meteorological 
events. This procedure let us to identify 767 convective events in Central Italy during 
the period March-September of 2003-2006 from which we estimate the Rainfall Lightning 
Ratio. As a consequence of the identihcation process, we are also able to delineate the life¬ 
time phases of a convective event using the beginning, peak and ending times of lightning 
temporal evolution to separate them. This aspect represents an advantage of our approach 
since it let us to incorporate into the model the different features of rainfall propagation in 
time assuming different weight structures in the equation which converts lightning into rain. 
Furthermore, as we have a large number of identihed events that we can reliably classify 
distinguishing among Small, Medium and Large events in terms of number of lightning. 
This allows us to estimate a different RLR for each event size. The validation of the scan 
statistics procedure is fully carried out for 7 events and we believe it requires some further 
investigation using clouds micro-physical features as a validation tool. 

The two convective events of August 5, 2004 and May 9, 2006 chosen for testing our 
model are both classified in the category of Large events. This fact have determined the 
adoption of the same value of RLR in the estimation of rainfall predictions. Nevertheless, 
the phenomenological features of the two convective events are substantially different from 
one another both in lightning intensity and in rainfall rates. In particular, the rain rates 
and volumes of the August 5, 2004 are larger than those of the May 9, 2006 event as well 
as the lightning intensity. The different models predictive performance may depend on the 
differing events characteristics. 

We estimate our model in 7 different modifications depending on the type of temporal 
and spatial weights adopted in the model mean description; three versions are based on 
lifetime phases of lightning propagation (Ml,M2 and M3); two versions are based on single 
cell activity (M4, M5); one version rely on both lifetime phases and cell activity (M6) 
and one version include a different spatial weights specification (memory). Modifications 
have consequences on model application since the two versions based on cell activity and 
the memory one can be potentially adapted in near-real time services when combined 
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1 hour aggregation of30-tnir space-time predictions 




log(observed+1) 

(d) 

Figure 9. A comparison of M3 predictions and GSMap satellite estimates 
for the event August 5, 2004 (30-minutes time scale): (a) estimate of errors 
probability density function blue for M3 and red for GSMap) with respect 
to rain gage observations at 1-hour time resolution; (b) and (c) the errors 
spatial distribution obtained by summing over the entire duration of the 
event for M3 and GsMap, respectively; (d) log predicted versus log-rain 
gage observations at 1-hour time resolution (M3 blue, GSMap red). 
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(a) 




0 12 3 4 


log(observed+1) 

(d) 

Figure 10. A comparison of M2memory predictions and GSMap satellite 
estimates for the event May 9, 2006 (15-minutes time scale): (a) estimate 
of errors the probability density function (blue for M2memory and red for 
GSMap) with respect to rain gage observations at 1-hour time resolution; 
((b) and (c) the errors spatial distribution obtained by summing over the 
entire duration of the event for M2memory and GSMap, respectively; (d) 
log predicted versus log rain gage observations at 1-hour time resolution 
(M2memory blue, GSMap red) 
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as they do not depend on knowledge of the entire storm behavior. The others are ex¬ 
post estimation models, that can be successfully used in improving historical databases. 
In particular estimates in midlatitude areas may benefit from our approach. Predictive 
intervals have empirical coverage closer to the nominal values of 90% for all the models 
confirming the accuracy of estimation of our modeling approach. Moreover, it appears 
that the proposed model is able to capture the time dynamic in all considered situations, 
remarking that peaks are better predicted when applied at 15-minutes aggregation of event 
May 9, 2006. In general, our model shows a better performance in fitting the data when 
applied on August 5, 2004 particularly at 30-minutes aggregation. In terms of RMSE, we 
can conclude that estimates are more reliable for the more intense event of August 5, 2004 
reinforcing the intuition that RLR based estimates work better for intense events. The 
evaluation done using Probability of Hits on Total, Probability of Detection, Probability 
of False Detection and HSS confirm that larger events are better predicted. 

On the whole, our modeling approach shows a good capability in capturing time dy¬ 
namic although it is unable to capture sudden changes in the rainfall series, that is a 
natural drawback of using a stationary model for time dynamic description. Another rele¬ 
vant point is that rainfall data are typically zero inflated then a model modification, such 
as using a mixture between a distribution that specifies the probability of having positive 
precipitation and a probability density function for the rainfall accumulation may improve 


the prediction (for instance Berrocal et ah, 2008). Further developments will include the 


adoption of a non separable space-time variability structure, that although will add com¬ 
putational complexity it might considerably improve the quality of prediction. 
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